
In this project, I build a demand-forecasting pipeline that combines historical rental counts with weather and time-of-day features to predict hourly bike-sharing usage. Several machine learning and data-mining algorithms, including Linear Regression, SVM, Gradient Boosting, and more, are implemented and compared using cross-validation. Each model is then evaluated on RMSE and MAE, and feature importances are analyzed to guide dynamic bike-redistribution strategies.
A secondary goal of this project is to learn and teach at the same time. I started it to deepen my own understanding of these methods and to share what I learned so that it might be useful to others. All resources used throughout the project are listed at the end.
Why does this matter? In recent years, many cities and businesses have turned to bike-sharing systems to improve urban mobility. Accurate demand forecasting is critical to their success. Over-supplying bikes inflates maintenance and parking costs, while under-supplying them leads to lost revenue and poor user experience. A reliable prediction model helps operators balance supply with demand, cutting unnecessary costs and maximizing profitability.
The Seoul Bike Sharing dataset contains 8,760 hourly observations spanning December 2017 to November 2018. Each record captures the number of bikes rented along with meteorological and temporal features. There are no missing values in the dataset, which makes it clean and ready for analysis right away.
Features include temperature, humidity, wind speed, visibility, dew point, solar radiation, rainfall, snowfall, season, holiday indicator, and functioning day status.
The rental count is heavily right-skewed (skewness = 1.15), with most hours recording under 1,000 rentals and a long tail stretching up to 3,556. To handle this, I applied the Box-Cox transformation from scipy.stats, which automatically finds the optimal lambda parameter. The resulting lambda of 0.35 brought the skewness almost to zero (-0.13), which is far better than a simple log transformation that actually overshoots into negative skew (-1.83). This is a great example of why Box-Cox is often preferred over log transforms for normalizing skewed distributions.
Looking at how demand changes throughout the day, we can see a clear bimodal pattern. There are two distinct peaks: one at 8 AM (morning commute) and a larger one at 6 PM (evening commute). Demand drops to its lowest point between 1 AM and 5 AM, which makes intuitive sense. The error bars show substantial variability at peak hours, meaning that while the average is high, actual demand can swing quite a bit depending on weather and whether it is a working day.
Summer is the clear winner with an average of 1,034 rentals per hour, followed by Autumn (820) and Spring (730). Winter drops dramatically to just 226 average rentals, which is less than a quarter of summer demand. This seasonal swing is something bike-sharing operators need to plan for seriously, as it affects fleet sizing, maintenance scheduling, and staffing.
Temperature is the single strongest predictor of bike rentals (Pearson r = 0.54). The scatter plot below, colored by hour of day, reveals an interesting pattern: it is not just that warmer days attract more riders, but that the combination of warm temperatures and evening hours (the purple/pink dots in the upper right) produces the highest demand spikes. This interaction effect is something tree-based models capture naturally.
The boxplots confirm the seasonal story and add an interesting detail about holidays. Working days actually show slightly higher median demand than holidays, likely because commute-related rides make up a large share of total usage. The spread on holidays is also wider, suggesting more unpredictable leisure usage.
To get the full picture of how every feature relates to every other feature, I created a 10x10 pairwise scatter matrix. Each off-diagonal cell shows a scatter plot between two features with the correlation coefficient displayed. The diagonal shows histograms. A few things jump out: Temperature and Dew Point are extremely correlated (r = 0.91), which signals multicollinearity. Humidity and Visibility have a strong negative relationship (r = -0.54). These insights help us understand the structure of our data before feeding it into models.
The full (symmetric) correlation heatmap below shows every pairwise Pearson correlation. The strongest predictors of bike rentals are Temperature (+0.54), Hour (+0.41), and Dew Point (+0.38). On the negative side, Humidity (-0.20), Snowfall (-0.14), and Rainfall (-0.12) all reduce demand. Notice the near-perfect correlation between Temperature and Dew Point, which is expected since dew point is physically tied to temperature.
Before training models, I applied the following preprocessing steps to clean and enrich the data:
The final feature set contains 15 features: the original numeric weather variables plus the engineered temporal features.
I trained nine different models to get a comprehensive comparison across linear, instance-based, and ensemble methods. All models were evaluated using 5-fold cross-validation on the training set, with final performance measured on the held-out test set.
| Model | RMSE | MAE | R² | CV-RMSE |
|---|---|---|---|---|
| Gradient Boosting BEST | 160.73 | 86.53 | 0.9342 | 147.41 |
| Random Forest | 172.98 | 95.78 | 0.9238 | 168.98 |
| Decision Tree | 240.89 | 131.22 | 0.8522 | 230.12 |
| K-Nearest Neighbors | 252.95 | 167.79 | 0.837 | 280.42 |
| SVR | 256.5 | 165.93 | 0.8324 | 279.66 |
| AdaBoost | 310.21 | 238.83 | 0.7549 | 320.63 |
| Lasso Regression | 395.04 | 302.19 | 0.6025 | 406.53 |
| Ridge Regression | 395.24 | 302.24 | 0.6021 | 406.5 |
| Linear Regression | 395.25 | 302.24 | 0.6021 | 406.51 |
Gradient Boosting came out on top with an R² of 0.9342 and the lowest RMSE of 160.73 among all models. Random Forest was a close second at R² = 0.9238. The linear models (Linear, Ridge, Lasso) all hovered around R² = 0.60, which tells us clearly that the relationship between features and demand is nonlinear. This is not surprising given the bimodal hourly pattern and the interactions between temperature and time of day.
The scatter plots below show how well the top three models predict actual demand. Points falling on the red dashed line represent perfect predictions. Gradient Boosting and Random Forest both track the diagonal closely, though both tend to slightly underpredict at the extreme high end (above 2,500 rentals).
The residuals for Gradient Boosting are centered around zero (mean = -4.72), which is good. The distribution looks roughly normal, though there is slight heteroscedasticity: the model's errors tend to be larger for higher predicted values. This is common in count data and could potentially be improved by modeling the log-transformed target or using a different loss function.
Temperature and Hour are by far the two most important features, together accounting for over 62% of the model's predictive power. This makes sense: temperature determines whether people want to ride, and hour determines whether people need to ride (commuting patterns). Humidity, Solar Radiation, and the IsRushHour flag form a secondary tier of influence.
Both ensemble methods agree almost perfectly on which features matter most. Temperature and Hour dominate in both models, and even the secondary features are ranked similarly. This consistency across different algorithms gives us confidence that these importance rankings reflect genuine patterns in the data, not artifacts of a single model's behavior.
All code cells with their outputs and chart results. Click any cell to expand. Use the Copy button to grab the code.
# Descriptive statistics print("Descriptive Statistics:") print(df.describe().round(2).to_string()) print(f"\nUnique Seasons: {df['Seasons'].unique().tolist()}") print(f"Unique Holiday: {df['Holiday'].unique().tolist()}") print(f"Unique Functioning Day: {df['Functioning Day'].unique().tolist()}") print(f"Date range: {df['Date'].iloc[0]} to {df['Date'].iloc[-1]}")
Descriptive Statistics:
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm)
count 8760.00 8760.00 8760.00 8760.00 8760.00 8760.00 8760.00 8760.00 8760.00 8760.00
mean 704.60 11.50 12.88 58.23 1.72 1436.83 4.07 0.57 0.15 0.08
std 645.00 6.92 11.94 20.36 1.04 608.30 13.06 0.87 1.13 0.44
min 0.00 0.00 -17.80 0.00 0.00 27.00 -30.60 0.00 0.00 0.00
25% 191.00 5.75 3.50 42.00 0.90 940.00 -4.70 0.00 0.00 0.00
50% 504.50 11.50 13.70 57.00 1.50 1698.00 5.10 0.01 0.00 0.00
75% 1065.25 17.25 22.50 74.00 2.30 2000.00 14.80 0.93 0.00 0.00
max 3556.00 23.00 39.40 98.00 7.40 2000.00 27.20 3.52 35.00 8.80
Unique Seasons: ['Winter', 'Spring', 'Summer', 'Autumn']
Unique Holiday: ['No Holiday', 'Holiday']
Unique Functioning Day: ['Yes', 'No']
Date range: 01/12/2017 to 30/11/2018# Target Distribution + Box-Cox Normalization y_raw = df['Rented Bike Count'] y_positive = y_raw + 1 # Box-Cox requires strictly positive values y_boxcox, bc_lambda = stats.boxcox(y_positive) print(f"Box-Cox optimal lambda: {bc_lambda:.4f}") print(f"Original -- Skewness: {y_raw.skew():.4f}, Kurtosis: {y_raw.kurtosis():.4f}") print(f"Box-Cox -- Skewness: {pd.Series(y_boxcox).skew():.4f}, Kurtosis: {pd.Series(y_boxcox).kurtosis():.4f}") print(f"Log1p -- Skewness: {np.log1p(y_raw).skew():.4f}, Kurtosis: {np.log1p(y_raw).kurtosis():.4f}") fig, axes = plt.subplots(1, 3, figsize=(18, 5)) axes[0].hist(y_raw, bins=50, color='#64ffda', alpha=0.8) axes[0].set_title('Original Distribution') axes[0].axvline(y_raw.mean(), color='red', linestyle='--', label=f'Mean: {y_raw.mean():.0f}') axes[0].axvline(y_raw.median(), color='orange', linestyle='--', label=f'Median: {y_raw.median():.0f}') axes[0].legend() axes[1].hist(y_boxcox, bins=50, color='#e07070', alpha=0.8) axes[1].set_title(f'Box-Cox Transformed (lambda={bc_lambda:.2f})') axes[2].hist(np.log1p(y_raw), bins=50, color='#e5c07b', alpha=0.8) axes[2].set_title('Log(1+x) Transformed') plt.tight_layout() plt.show()
Box-Cox optimal lambda: 0.3495 Original — Skewness: 1.1534, Kurtosis: 0.8534 Box-Cox — Skewness: -0.1274, Kurtosis: -0.5611 Log1p — Skewness: -1.8322, Kurtosis: 4.3387
# Average bike rentals by hour hourly = df.groupby('Hour')['Rented Bike Count'].agg(['mean', 'std', 'median']) print("Hourly Statistics:") print(hourly.round(1).to_string()) fig, ax = plt.subplots(figsize=(12, 5)) ax.bar(hourly.index, hourly['mean'], color='#64ffda', alpha=0.85) ax.errorbar(hourly.index, hourly['mean'], yerr=hourly['std'], fmt='none', ecolor='gray', alpha=0.4, capsize=2) ax.set_title('Average Bike Rentals by Hour (with Std Dev)') ax.set_xlabel('Hour of Day') ax.set_ylabel('Average Rentals') ax.set_xticks(range(0, 24)) plt.tight_layout() plt.show()
Hourly Statistics:
mean std median
Hour
0 541.5 364.6 513.0
1 426.2 285.5 401.0
2 301.6 210.1 265.0
3 203.3 143.2 176.0
4 132.6 90.3 119.0
5 139.1 95.5 129.0
6 287.6 222.8 232.0
7 606.0 482.3 426.0
8 1015.7 761.6 728.0
9 646.0 398.6 680.0
10 527.8 323.0 581.0
11 600.9 362.0 624.0
12 699.4 430.7 709.0
13 733.2 457.7 727.0
14 758.8 488.9 733.0
15 829.2 546.5 785.0
16 930.6 618.0 911.0
17 1138.5 748.9 1184.0
18 1502.9 1029.3 1548.0
19 1195.1 857.4 1224.0
20 1069.0 793.9 1062.0
21 1031.4 753.6 1046.0
22 922.8 660.8 949.0
23 671.1 478.8 656.0# Seasonal pattern season_order = ['Spring', 'Summer', 'Autumn', 'Winter'] seasonal = df.groupby('Seasons')['Rented Bike Count'].agg(['mean', 'std', 'count']).reindex(season_order) print("Seasonal Statistics:") print(seasonal.round(1).to_string()) fig, ax = plt.subplots(figsize=(8, 5)) bars = ax.bar(seasonal.index, seasonal['mean'], color=['#64ffda', '#e07070', '#e5c07b', '#6a9a8a'], alpha=0.85) ax.set_title('Average Bike Rentals by Season') ax.set_ylabel('Average Rentals') for bar, val in zip(bars, seasonal['mean']): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 15, f'{val:.0f}', ha='center') plt.tight_layout() plt.show()
Seasonal Statistics:
mean std count
Seasons
Spring 730.0 621.5 2208
Summer 1034.1 690.2 2208
Autumn 819.6 651.1 2184
Winter 225.5 150.4 2160# Temperature vs Rentals scatter corr_temp = df['Temperature(C)'].corr(df['Rented Bike Count']) print(f"Pearson correlation (Temperature vs Rentals): {corr_temp:.4f}") fig, ax = plt.subplots(figsize=(10, 6)) scatter = ax.scatter(df['Temperature(C)'], df['Rented Bike Count'], c=df['Hour'], cmap='cool', alpha=0.3, s=8) plt.colorbar(scatter, ax=ax, label='Hour of Day') ax.set_title('Temperature vs Bike Rentals (colored by Hour)') ax.set_xlabel('Temperature (C)') ax.set_ylabel('Rented Bike Count') plt.tight_layout() plt.show()
Pearson correlation (Temperature vs Rentals): 0.5386
# Boxplots: Season & Holiday fig, axes = plt.subplots(1, 2, figsize=(16, 6)) season_data = [df[df['Seasons'] == s]['Rented Bike Count'].values for s in season_order] axes[0].boxplot(season_data, labels=season_order, patch_artist=True, boxprops=dict(facecolor='#64ffda', alpha=0.6)) axes[0].set_title('Rental Distribution by Season') holiday_data = [df[df['Holiday'] == h]['Rented Bike Count'].values for h in ['No Holiday', 'Holiday']] axes[1].boxplot(holiday_data, labels=['Working Day', 'Holiday'], patch_artist=True, boxprops=dict(facecolor='#e07070', alpha=0.6)) axes[1].set_title('Rental Distribution: Working Day vs Holiday') plt.tight_layout() plt.show()
# Full Pairwise Scatter Plot (all numeric features) numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist() print(f"Numeric features for pairplot ({len(numeric_cols)}): {numeric_cols}") corr_matrix = df[numeric_cols].corr() print(f"\nFull Pairwise Correlation Matrix:") print(corr_matrix.round(4).to_string()) # Pairplot with correlation values sample = df.sample(n=1500, random_state=42)[numeric_cols] n = len(numeric_cols) fig, axes = plt.subplots(n, n, figsize=(28, 28)) for i in range(n): for j in range(n): ax = axes[i][j] if i == j: ax.hist(sample[numeric_cols[i]], bins=25, color='#64ffda', alpha=0.7) else: ax.scatter(sample[numeric_cols[j]], sample[numeric_cols[i]], s=1.5, alpha=0.35) r = corr_matrix.iloc[i, j] ax.text(0.95, 0.95, f'{r:.2f}', transform=ax.transAxes, ha='right', va='top', fontsize=6, fontweight='bold') if i == n - 1: ax.set_xlabel(numeric_cols[j], fontsize=5, rotation=45, ha='right') if j == 0: ax.set_ylabel(numeric_cols[i], fontsize=5) fig.suptitle('Pairwise Feature Relationships', fontsize=16) plt.tight_layout() plt.show()
Numeric features for pairplot (10): ['Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']
Full Pairwise Correlation Matrix:
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm)
Rented Bike Count 1.0000 0.4103 0.5386 -0.1998 0.1211 0.1993 0.3798 0.2618 -0.1231 -0.1418
Hour 0.4103 1.0000 0.1241 -0.2416 0.2852 0.0988 0.0031 0.1451 0.0087 -0.0215
Temperature(°C) 0.5386 0.1241 1.0000 0.1594 -0.0363 0.0348 0.9128 0.3535 0.0503 -0.2184
Humidity(%) -0.1998 -0.2416 0.1594 1.0000 -0.3367 -0.5431 0.5369 -0.4619 0.2364 0.1082
Wind speed (m/s) 0.1211 0.2852 -0.0363 -0.3367 1.0000 0.1715 -0.1765 0.3323 -0.0197 -0.0036
Visibility (10m) 0.1993 0.0988 0.0348 -0.5431 0.1715 1.0000 -0.1766 0.1497 -0.1676 -0.1217
Dew point temperature(°C) 0.3798 0.0031 0.9128 0.5369 -0.1765 -0.1766 1.0000 0.0944 0.1256 -0.1509
Solar Radiation (MJ/m2) 0.2618 0.1451 0.3535 -0.4619 0.3323 0.1497 0.0944 1.0000 -0.0743 -0.0723
Rainfall(mm) -0.1231 0.0087 0.0503 0.2364 -0.0197 -0.1676 0.1256 -0.0743 1.0000 0.0085
Snowfall (cm) -0.1418 -0.0215 -0.2184 0.1082 -0.0036 -0.1217 -0.1509 -0.0723 0.0085 1.0000# Complete Correlation Heatmap (full matrix, no mask) numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist() corr = df[numeric_cols].corr() target_corr = corr['Rented Bike Count'].drop('Rented Bike Count').sort_values(key=abs, ascending=False) print("Correlations with Rented Bike Count (sorted by |r|):") for feat, r in target_corr.items(): bar = chr(9608) * int(abs(r) * 30) sign = '+' if r > 0 else '-' print(f" {feat:35s} {sign}{abs(r):.4f} {bar}") fig, ax = plt.subplots(figsize=(14, 11)) sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlGn', center=0, ax=ax, square=True, linewidths=0.5, vmin=-1, vmax=1, cbar_kws={'shrink': 0.8}, annot_kws={'size': 8}) ax.set_title('Complete Correlation Heatmap (All Features)') plt.tight_layout() plt.show()
Correlations with Rented Bike Count (sorted by |r|): Temperature(°C) +0.5386 ████████████████ Hour +0.4103 ████████████ Dew point temperature(°C) +0.3798 ███████████ Solar Radiation (MJ/m2) +0.2618 ███████ Humidity(%) -0.1998 █████ Visibility (10m) +0.1993 █████ Snowfall (cm) -0.1418 ████ Rainfall(mm) -0.1231 ███ Wind speed (m/s) +0.1211 ███
# Data Preprocessing & Feature Engineering df_model = df[df['Functioning Day'] == 'Yes'].copy() print(f"After filtering Functioning Day = Yes: {len(df_model)} rows") df_model['Date'] = pd.to_datetime(df_model['Date'], format='%d/%m/%Y') df_model['Month'] = df_model['Date'].dt.month df_model['DayOfWeek'] = df_model['Date'].dt.dayofweek df_model['IsWeekend'] = (df_model['DayOfWeek'] >= 5).astype(int) df_model['IsRushHour'] = df_model['Hour'].apply( lambda x: 1 if x in [7, 8, 9, 17, 18, 19] else 0) le_season = LabelEncoder() df_model['Seasons_enc'] = le_season.fit_transform(df_model['Seasons']) df_model['Holiday_enc'] = (df_model['Holiday'] == 'Holiday').astype(int) feature_cols = [ 'Hour', 'Temperature(C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons_enc', 'Holiday_enc', 'Month', 'DayOfWeek', 'IsWeekend', 'IsRushHour' ] X = df_model[feature_cols] y = df_model['Rented Bike Count'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print(f"Features ({len(feature_cols)}): {feature_cols}") print(f"Train set: {len(X_train)} samples") print(f"Test set: {len(X_test)} samples")
After filtering Functioning Day = Yes: 8465 rows (removed 295)
Season encoding: {'Autumn': np.int64(0), 'Spring': np.int64(1), 'Summer': np.int64(2), 'Winter': np.int64(3)}
Features (15): ['Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons_enc', 'Holiday_enc', 'Month', 'DayOfWeek', 'IsWeekend', 'IsRushHour']
Train set: 6772 samples
Test set: 1693 samples
Scaler means: {'Hour': np.float64(11.48), 'Temperature(°C)': np.float64(12.79), 'Humidity(%)': np.float64(58.32), 'Wind speed (m/s)': np.float64(1.73), 'Visibility (10m)': np.float64(1429.69), 'Dew point temperature(°C)': np.float64(4.0), 'Solar Radiation (MJ/m2)': np.float64(0.57), 'Rainfall(mm)': np.float64(0.15), 'Snowfall (cm)': np.float64(0.08), 'Seasons_enc': np.float64(1.54), 'Holiday_enc': np.float64(0.05), 'Month': np.float64(6.43), 'DayOfWeek': np.float64(3.02), 'IsWeekend': np.float64(0.29), 'IsRushHour': np.float64(0.25)}# Train & Evaluate All Models models = { 'Linear Regression': LinearRegression(), 'Ridge Regression': Ridge(alpha=1.0), 'Lasso Regression': Lasso(alpha=1.0), 'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5), 'Decision Tree': DecisionTreeRegressor(max_depth=12, random_state=42), 'Random Forest': RandomForestRegressor(n_estimators=200, max_depth=15, random_state=42, n_jobs=-1), 'Gradient Boosting': GradientBoostingRegressor(n_estimators=300, max_depth=6, learning_rate=0.1, random_state=42), 'AdaBoost': AdaBoostRegressor(n_estimators=150, learning_rate=0.1, random_state=42), 'SVR': SVR(kernel='rbf', C=100, epsilon=0.1), } scale_models = {'Linear Regression', 'Ridge Regression', 'Lasso Regression', 'K-Nearest Neighbors', 'SVR'} results = {} for name, model in models.items(): if name in scale_models: model.fit(X_train_scaled, y_train) y_pred = model.predict(X_test_scaled) cv = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error') else: model.fit(X_train, y_train) y_pred = model.predict(X_test) cv = cross_val_score(model, X_train.values, y_train.values, cv=5, scoring='neg_mean_squared_error') rmse = np.sqrt(mean_squared_error(y_test, y_pred)) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) cv_rmse = np.sqrt(-cv.mean()) results[name] = {'RMSE': rmse, 'MAE': mae, 'R2': r2, 'CV_RMSE': cv_rmse} print(f"{name:<25s} RMSE: {rmse:.2f} MAE: {mae:.2f} R2: {r2:.4f} CV-RMSE: {cv_rmse:.2f}")
Model RMSE MAE R² CV-RMSE -------------------------------------------------------------- Linear Regression 395.25 302.24 0.6021 406.51 Ridge Regression 395.24 302.24 0.6021 406.50 Lasso Regression 395.04 302.19 0.6025 406.53 K-Nearest Neighbors 252.95 167.79 0.8370 280.42 Decision Tree 240.89 131.22 0.8522 230.12 Random Forest 172.98 95.78 0.9238 168.98 Gradient Boosting 160.73 86.53 0.9342 147.41 AdaBoost 310.21 238.83 0.7549 320.63 SVR 256.50 165.93 0.8324 279.66 >>> Best model: Gradient Boosting (RMSE=160.73, R²=0.9342)
# Model Comparison Charts model_names = list(results.keys()) rmse_vals = [results[m]['RMSE'] for m in model_names] mae_vals = [results[m]['MAE'] for m in model_names] r2_vals = [results[m]['R2'] for m in model_names] fig, axes = plt.subplots(1, 3, figsize=(18, 6)) for ax, vals, title, best_fn in [ (axes[0], rmse_vals, 'RMSE (lower is better)', np.argmin), (axes[1], mae_vals, 'MAE (lower is better)', np.argmin), (axes[2], r2_vals, 'R2 Score (higher is better)', np.argmax), ]: sorted_idx = np.argsort(vals) if 'lower' in title else np.argsort(vals)[::-1] best_i = best_fn(vals) colors = ['#64ffda' if i == best_i else '#4a4a5a' for i in range(len(model_names))] ax.barh([model_names[i] for i in sorted_idx], [vals[i] for i in sorted_idx], color=[colors[i] for i in sorted_idx], alpha=0.85) ax.set_title(title) plt.tight_layout() plt.show()
# Actual vs Predicted -- Top 3 Models top3 = sorted(results, key=lambda k: results[k]['RMSE'])[:3] fig, axes = plt.subplots(1, 3, figsize=(18, 5.5)) for ax, name in zip(axes, top3): y_pred = models[name].predict(X_test_scaled if name in scale_models else X_test) ax.scatter(y_test, y_pred, alpha=0.25, s=8, color='#64ffda') max_val = max(y_test.max(), y_pred.max()) ax.plot([0, max_val], [0, max_val], '--', color='red', linewidth=1.5, label='Perfect') ax.set_title(f'{name}\nR2 = {results[name]["R2"]:.4f}') ax.set_xlabel('Actual'); ax.set_ylabel('Predicted') ax.legend(fontsize=8) plt.tight_layout() plt.show()
# Feature Importance -- Gradient Boosting gb = models['Gradient Boosting'] importance = pd.DataFrame({ 'feature': feature_cols, 'importance': gb.feature_importances_ }).sort_values('importance', ascending=False) print("Feature Importance (Gradient Boosting):") for _, row in importance.iterrows(): bar = chr(9608) * int(row['importance'] * 60) print(f" {row['feature']:35s} {row['importance']:.4f} {bar}") importance_sorted = importance.sort_values('importance', ascending=True) fig, ax = plt.subplots(figsize=(10, 7)) ax.barh(importance_sorted['feature'], importance_sorted['importance'], color=['#64ffda' if v >= importance['importance'].quantile(0.75) else '#4a4a5a' for v in importance_sorted['importance'].values]) ax.set_title('Feature Importance (Gradient Boosting)') ax.set_xlabel('Importance') plt.tight_layout() plt.show()
Feature Importance (Gradient Boosting): Temperature(°C) 0.3379 ████████████████████ Hour 0.2866 █████████████████ Humidity(%) 0.0823 ████ Solar Radiation (MJ/m2) 0.0729 ████ Rainfall(mm) 0.0452 ██ IsRushHour 0.0352 ██ Seasons_enc 0.0321 █ DayOfWeek 0.0253 █ IsWeekend 0.0238 █ Dew point temperature(°C) 0.0237 █ Month 0.0166 Visibility (10m) 0.0068 Holiday_enc 0.0067 Wind speed (m/s) 0.0045 Snowfall (cm) 0.0005
# Residual Analysis -- Best Model best_name = min(results, key=lambda k: results[k]['RMSE']) best_pred = models[best_name].predict(X_test) residuals = y_test.values - best_pred print(f"Residual Analysis for {best_name}:") print(f" Mean residual: {residuals.mean():.2f}") print(f" Std residual: {residuals.std():.2f}") print(f" Min residual: {residuals.min():.2f}") print(f" Max residual: {residuals.max():.2f}") fig, axes = plt.subplots(1, 2, figsize=(14, 5)) axes[0].scatter(best_pred, residuals, alpha=0.25, s=8, color='#64ffda') axes[0].axhline(y=0, color='red', linestyle='--') axes[0].set_title(f'Residual Plot ({best_name})') axes[0].set_xlabel('Predicted'); axes[0].set_ylabel('Residual') axes[1].hist(residuals, bins=50, color='#64ffda', alpha=0.8) axes[1].axvline(x=0, color='red', linestyle='--') axes[1].set_title('Residual Distribution') plt.tight_layout() plt.show()
Residual Analysis for Gradient Boosting: Mean residual: -4.72 Std residual: 160.66 Min residual: -2309.85 Max residual: 1331.65
# Feature Importance Comparison: Random Forest vs Gradient Boosting rf = models['Random Forest'] comp = pd.DataFrame({ 'feature': feature_cols, 'Random Forest': rf.feature_importances_, 'Gradient Boosting': gb.feature_importances_ }).sort_values('Gradient Boosting', ascending=True) print("Feature Importance Comparison (RF vs GB):") print(comp.sort_values('Gradient Boosting', ascending=False).to_string(index=False)) fig, ax = plt.subplots(figsize=(10, 7)) y_pos = np.arange(len(comp)) ax.barh(y_pos - 0.2, comp['Random Forest'].values, 0.4, label='Random Forest', color='#64ffda') ax.barh(y_pos + 0.2, comp['Gradient Boosting'].values, 0.4, label='Gradient Boosting', color='#e07070') ax.set_yticks(y_pos) ax.set_yticklabels(comp['feature'].values) ax.set_title('Feature Importance: RF vs Gradient Boosting') ax.set_xlabel('Importance') ax.legend() plt.tight_layout() plt.show()
Feature Importance Comparison (RF vs GB):
feature Random Forest Gradient Boosting
Temperature(°C) 0.344134 0.337856
Hour 0.276564 0.286580
Humidity(%) 0.084948 0.082297
Solar Radiation (MJ/m2) 0.084492 0.072897
Rainfall(mm) 0.036274 0.045226
IsRushHour 0.037281 0.035177
Seasons_enc 0.031611 0.032062
DayOfWeek 0.025238 0.025285
IsWeekend 0.018051 0.023807
Dew point temperature(°C) 0.022503 0.023723
Month 0.013933 0.016595
Visibility (10m) 0.009160 0.006811
Holiday_enc 0.005820 0.006729
Wind speed (m/s) 0.009347 0.004477
Snowfall (cm) 0.000646 0.000479Gradient Boosting outperformed all other models with an R² of 0.9342, followed closely by Random Forest (R² = 0.9238). Linear models performed poorly (R² around 0.60), confirming that the relationship between features and demand is highly nonlinear. Temperature and Hour of Day are the dominant predictors, together explaining over 60% of the model's decisions.
The Box-Cox transformation with an optimal lambda of 0.35 effectively normalized the right-skewed target distribution, reducing skewness from 1.15 to -0.13. The full pairwise analysis revealed strong multicollinearity between Temperature and Dew Point (r = 0.91), which is worth noting for linear model interpretation but is handled gracefully by tree-based methods.
These results suggest that bike-sharing operators should prioritize temperature forecasts and time-of-day scheduling when planning bike redistribution. The IsRushHour and Humidity features provide additional predictive value for fine-tuning supply during peak periods.